Airbnb is a company that provides a platform users classified as homeowners to rent out rooms, or the entire house, to any of the Airbnb users classified as renters. This is primarily done in high tourism locations, but is also fairly prevelant in most of the large US cities. Airbnb would like to increase its user base by identifying potential hosts to convince them to use their property as a rental location. Airbnb receives a percentage of the profit that the rentals on their platform create; therefore it is in their best interest to grow the number of rental properties for their potential customers. The company has already gathered a large amount of data on its current listings and has come to our group to analyze said data. They would like to provide the hosts with suggestions for their property descriptions, which amenities to offer, etc., and use these values to predict the ratings, listing prices, booking percentage of available dates, and more. To pilot this idea, the company would like to focus on one city in the United States before rolling this predictive model out to other locations. The dataset to be used in this analysis will be from Inside Airbnb, specifically the Chicago, Illinois datasets. The company would like to use this analysis to determine the ideal property type, location, and amenities to look for potential rental locations, and then predict how the host’s actions (descriptions, response rate, etc.) would affect the listing’s potential.
The hosts want to lease their properties on Airbnb, but they are not sure what prices they should put for their new properties (of differing types). They want to build and use a statistical model to predict the Airbnb rental trend next year. As the economy grows every year, Airbnb is looking for a new price range for their new properties. The hosts can use this price reference to determine whether they should list their properties on Airbnb. Airbnb can use this model to make their marketing plan to recruit their target partners (hosts).
Customers are looking for properties to rent that fit their budget of expenditure in comparison with the amenities offered at the rental locations. Therefore, customers are looking for reliable feedback and rating system that aligns with what they are willing to pay to stay there. Airbnb would use this model to offer the customers the right properties, within their price range, to increase the chance of them using their platform. This model will be based on the feedback that the customer provides after visiting a particular property, and then will use this data to improve its recommendation system for future customers who look at the property.
As the customers are clicking on the property that they are interested in, Airbnb also offers the customers similar properties they might like. This allows the customers to explore more options related to their search criteria and increases the chances for the customer to go for the options suggested by Airbnb.
The main data is sourced from the Inside Airbnb website at: http://insideairbnb.com/get-the-data.html. This data was collected by Airbnb and posted online for use by anyone.
City of Chicago latitude and longitude sourced from: https://www.latlong.net/place/chicago-il-usa-1855.html
Key information in the dataset includes, but is not limited to:
Listing ID, name, description, and neighborhood overview, location (including longitude and latitude), various descriptors (property type, beds, bathrooms, amenities, price, etc.)
Host ID, name, location, “about” description, response rate and time, acceptance rate, number of listings, if they have a profile picture uploaded, if their identity has been verified
Review information including ratings on listing accuracy, cleanliness, location, etc.
In order to start the data analysis, we will need to import a variety of packages.
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
import math
import statsmodels.api as sm
from sklearn.preprocessing import scale
from sklearn.decomposition import PCA
from wordcloud import WordCloud, STOPWORDS
import folium
from folium import plugins
from folium.plugins import HeatMap
import json
from glob import glob
# kindly make sure to install folium and wordcloud packages before executing the code.
We will read in the data, which was downloaded from the Airbnb website (InsideAirbnb) and look at its info.
# Read in data
dat = pd.read_csv('listings.csv')
dat.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 6366 entries, 0 to 6365 Data columns (total 74 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 6366 non-null int64 1 listing_url 6366 non-null object 2 scrape_id 6366 non-null float64 3 last_scraped 6366 non-null object 4 name 6366 non-null object 5 description 6352 non-null object 6 neighborhood_overview 4663 non-null object 7 picture_url 6366 non-null object 8 host_id 6366 non-null int64 9 host_url 6366 non-null object 10 host_name 6365 non-null object 11 host_since 6365 non-null object 12 host_location 6355 non-null object 13 host_about 4381 non-null object 14 host_response_time 5187 non-null object 15 host_response_rate 5187 non-null object 16 host_acceptance_rate 5381 non-null object 17 host_is_superhost 6365 non-null object 18 host_thumbnail_url 6365 non-null object 19 host_picture_url 6365 non-null object 20 host_neighbourhood 5869 non-null object 21 host_listings_count 6365 non-null float64 22 host_total_listings_count 6365 non-null float64 23 host_verifications 6366 non-null object 24 host_has_profile_pic 6365 non-null object 25 host_identity_verified 6365 non-null object 26 neighbourhood 4664 non-null object 27 neighbourhood_cleansed 6366 non-null object 28 neighbourhood_group_cleansed 0 non-null float64 29 latitude 6366 non-null float64 30 longitude 6366 non-null float64 31 property_type 6366 non-null object 32 room_type 6366 non-null object 33 accommodates 6366 non-null int64 34 bathrooms 0 non-null float64 35 bathrooms_text 6362 non-null object 36 bedrooms 5829 non-null float64 37 beds 6285 non-null float64 38 amenities 6366 non-null object 39 price 6366 non-null object 40 minimum_nights 6366 non-null int64 41 maximum_nights 6366 non-null int64 42 minimum_minimum_nights 6366 non-null int64 43 maximum_minimum_nights 6366 non-null int64 44 minimum_maximum_nights 6366 non-null int64 45 maximum_maximum_nights 6366 non-null int64 46 minimum_nights_avg_ntm 6366 non-null float64 47 maximum_nights_avg_ntm 6366 non-null float64 48 calendar_updated 0 non-null float64 49 has_availability 6366 non-null object 50 availability_30 6366 non-null int64 51 availability_60 6366 non-null int64 52 availability_90 6366 non-null int64 53 availability_365 6366 non-null int64 54 calendar_last_scraped 6366 non-null object 55 number_of_reviews 6366 non-null int64 56 number_of_reviews_ltm 6366 non-null int64 57 number_of_reviews_l30d 6366 non-null int64 58 first_review 5282 non-null object 59 last_review 5282 non-null object 60 review_scores_rating 5282 non-null float64 61 review_scores_accuracy 5246 non-null float64 62 review_scores_cleanliness 5246 non-null float64 63 review_scores_checkin 5245 non-null float64 64 review_scores_communication 5244 non-null float64 65 review_scores_location 5245 non-null float64 66 review_scores_value 5245 non-null float64 67 license 6031 non-null object 68 instant_bookable 6366 non-null object 69 calculated_host_listings_count 6366 non-null int64 70 calculated_host_listings_count_entire_homes 6366 non-null int64 71 calculated_host_listings_count_private_rooms 6366 non-null int64 72 calculated_host_listings_count_shared_rooms 6366 non-null int64 73 reviews_per_month 5282 non-null float64 dtypes: float64(20), int64(20), object(34) memory usage: 3.6+ MB
We can see that there are 74 columns, and 6,366 observations, however some values are missing in various columns.
# Drop url columns
dat = dat.drop(['listing_url','host_url','host_thumbnail_url','host_picture_url','picture_url'], axis = 1)
# Drop empty columns
dat = dat.drop(['neighbourhood_group_cleansed','bathrooms','calendar_updated'], axis = 1)
# Drop rest
dat = dat.drop(['neighbourhood','host_listings_count','host_total_listings_count','scrape_id','calendar_last_scraped'], axis = 1)
Let us check for duplicate values and columns.
print ("Dataframe shape prior to removing duplicates: " + str(dat.shape))
dat = dat.drop_duplicates()
dat = dat.loc[:,~dat.columns.duplicated()]
print ("Dataframe shape after removing duplicates: " + str(dat.shape))
Dataframe shape prior to removing duplicates: (6366, 61) Dataframe shape after removing duplicates: (6366, 61)
We can see that the dataframe did not include any duplicate observations.
dat['bathrooms_text'].sample(10)
897 1 shared bath 4396 1.5 baths 5453 1 bath 1680 1 bath 2164 2 baths 328 2 baths 5844 2 baths 4855 2.5 baths 4757 1 shared bath 1199 1 bath Name: bathrooms_text, dtype: object
Looking at the 'bathrooms_text' column, we see that it is not very usable in its current state. We will split the bathroom text column into two: one containing a float variable for the number of bathrooms, and the other an additional descriptor of the bathroom (shared/private).
# First let us make all text lowercase to simplify string manipulation
dat['bathrooms_text'] = dat['bathrooms_text'].str.lower()
# Next we must convert any text 'half' to 0.5 so it is included in the subsequent number extraction
dat['bathrooms_text'] = dat['bathrooms_text'].str.replace(r'(half)+','0.5', regex = True)
# Then extract the numbers into the new 'bathrooms' float32 data type column
dat['bathrooms'] = dat['bathrooms_text'].str.extract(r'(\d+\.?\d*)', expand = True).astype('float')
# This leaves us with only float and NaN values
dat['bathrooms'].unique()
array([ 1. , 2. , 1.5, 3. , 2.5, 0. , 3.5, 11. , 5. , nan, 0.5,
4. , 4.5, 11.5, 6.5, 7. , 5.5, 6. , 8. , 12.5, 10. ])
Above are the unique values left for the float variable in the column 'bathrooms'. The test portion requires a little more refining:
# Remove the text 'bath', unnecessary symbols and whitespace, then extract the leftover text
dat['bathrooms_text'] = dat['bathrooms_text'].str.replace(r'(bath)s*|(Bath)s*','', regex = True)
dat['bathrooms_text'] = dat['bathrooms_text'].str.replace(r' +|\.+|\-+','', regex = True)
dat['bathrooms_text'] = dat['bathrooms_text'].str.extract(r'(\D+)')
dat['bathrooms_text'].count()
1587
dat['bathrooms_text'].unique()
array(['shared', nan, 'private'], dtype=object)
This leaves us with only 1,587 observations that contain one of the bathroom descriptors, 'shared' or 'private'; the rest are missing values since the original data did not contain text for them.
Next, let's convert the datetime columns into the proper datatype.
# Convert dates to datetime data type
for x in ['last_scraped', 'host_since', 'first_review', 'last_review']:
dat[x] = pd.to_datetime(dat[x])
In order to see the length of time that a host has been active on the platform, we will create a new column called 'host_age' which we can visualize later. Note: this will also be converted to a float variable instead of a timedelta variable.
dat['host_age'] = (dat['last_scraped'] - dat['host_since'])/pd.to_timedelta(1, unit='D')
dat['host_age'].sample(5)
4195 1667.0 4533 1533.0 2241 1146.0 417 2244.0 783 2151.0 Name: host_age, dtype: float64
dat['host_response_rate'].sample(5)
4264 NaN 2058 100% 464 NaN 3068 100% 6230 97% Name: host_response_rate, dtype: object
We can see that we will need to convert the percentage columns ('host_response_rate' and 'host_acceptance_rate') into float variables.
# Strip the percentage sign and convert host response rate and acceptance rate columns into float
dat['host_response_rate'] = dat['host_response_rate'].str.replace(r'(\D)','', regex = True).astype('float')/100
dat['host_acceptance_rate'] = dat['host_acceptance_rate'].str.replace(r'(\D)','', regex = True).astype('float')/100
dat['host_response_rate'].sample(5)
573 1.0 1662 1.0 850 1.0 4528 NaN 5544 1.0 Name: host_response_rate, dtype: float64
dat['host_is_superhost'].unique()
array(['t', 'f', nan], dtype=object)
We will convert the 't' and 'f' values to binary float values in all of the boolean columns for later analysis, where 1 will mean "True".
# Map all boolean columns to float 1/0 values
dat['host_is_superhost'] = dat['host_is_superhost'].map({'t':1,'f':0}).astype('float')
dat['host_has_profile_pic'] = dat['host_has_profile_pic'].map({'t':1,'f':0}).astype('float')
dat['host_identity_verified'] = dat['host_identity_verified'].map({'t':1,'f':0}).astype('float')
dat['has_availability'] = dat['has_availability'].map({'t':1,'f':0}).astype('float')
dat['instant_bookable'] = dat['instant_bookable'].map({'t':1,'f':0}).astype('float')
dat['host_is_superhost'].unique()
array([ 1., 0., nan])
Convert the price column to a float data type.
# Extract only the decimal digits and the period
dat['price'] = dat['price'].str.extract(r'(\d+\.\d+)').astype('float')
dat['price'].sample(5)
951 144.0 5994 59.0 4066 365.0 5953 386.0 3862 159.0 Name: price, dtype: float64
Since we will not be able to use the individual license numbers, we will convert this column into a categorical variable, where 1 means the listing has a license, and 0 does not. For our purposes, we will consider a license that is still pending as not having a license (i.e. 0).
dat['license'].sample(10)
6322 NaN 509 R17000014262 2958 R21000061382 3642 NaN 3667 R19000046650 1853 R19000050608 4463 32+ Days Listing 4458 City registration pending 4850 City registration pending 3273 City registration pending Name: license, dtype: object
dat['license'] = dat['license'].fillna(0)
dat.loc[dat['license'].str.contains('pending', na=False), 'license'] = 0
dat.loc[dat['license'] != 0, 'license'] = 1
dat['license'] = dat['license'].astype('float')
dat['license'].sample(10)
5019 0.0 216 1.0 1017 1.0 4334 1.0 6160 1.0 2914 1.0 4017 1.0 4799 0.0 5096 1.0 3967 1.0 Name: license, dtype: float64
Let us create an average rating column that includes the average of all review scores values.
dat['avg_rating'] = dat[['review_scores_rating','review_scores_accuracy','review_scores_cleanliness',
'review_scores_checkin','review_scores_communication','review_scores_location',
'review_scores_value']].mean(axis = 1)
dat['avg_rating'].describe()
count 5282.000000 mean 4.748542 std 0.514896 min 0.000000 25% 4.737143 50% 4.861429 75% 4.938571 max 5.000000 Name: avg_rating, dtype: float64
For the column 'host_response_time', convert to binary value and create a new column represent if a host respose is within a day.
#Categorical within an hour,within a few hours, within a day to 1(true), NaN,a few days or more to 0(false)
dat['host_response_inADay'] = dat.host_response_time.map({'within an hour': 1,
'within a few hours': 1,
'within a day':1,
'a few days or more':0,
np.nan:0})
For columns: description, neighborhood_overview, host_location, host_about, host_neighbourhood using 'Unknown' to fill the missing value, because these columns does not have direct effect on project topic.
dat['description'].fillna(value='Unknown', inplace=True)
dat['neighborhood_overview'].fillna(value='Unknown', inplace=True)
dat['host_location'].fillna(value='Unknown', inplace=True)
dat['host_about'].fillna(value='Unknown', inplace=True)
dat['host_neighbourhood'].fillna(value='Unknown', inplace=True)
For columns: host_name, host_since, host_has_profile_pic, host_identity_verified, it is easy to see below these rows contains many NaN value, so drop directly.
dat = dat.dropna(subset=['host_name'])
nullseries = dat.isnull().sum()
print(nullseries[nullseries > 0])
host_response_time 1178 host_response_rate 1178 host_acceptance_rate 984 bathrooms_text 4779 bedrooms 536 beds 81 first_review 1083 last_review 1083 review_scores_rating 1083 review_scores_accuracy 1119 review_scores_cleanliness 1119 review_scores_checkin 1120 review_scores_communication 1121 review_scores_location 1120 review_scores_value 1120 reviews_per_month 1083 bathrooms 4 avg_rating 1083 dtype: int64
For columns: bedrooms and beds, using mode to fill missing value, because fill with natural value to make the result less biased.
print(dat['bedrooms'].mode())
print(dat['beds'].mode())
print(dat['bathrooms'].mode())
0 1.0 dtype: float64 0 1.0 dtype: float64 0 1.0 dtype: float64
dat['bedrooms'].fillna(value=1, inplace=True)
dat['beds'].fillna(value=1, inplace=True)
dat['bathrooms'].fillna(value=1, inplace=True)
For columns: first_review, last_review, review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value, reviews_per_month, most missing value caused by 'number_of_reviews'==0. So, fill those missing value with 0.0. Later review analysis process will exclude these rows since no number of reviews.
# first_review, last_review, will not be filled since no value suitable
dat['review_scores_rating'].fillna(value=0.0, inplace=True)
dat['review_scores_accuracy'].fillna(value=0.0, inplace=True)
dat['review_scores_cleanliness'].fillna(value=0.0, inplace=True)
dat['review_scores_checkin'].fillna(value=0.0, inplace=True)
dat['review_scores_communication'].fillna(value=0.0, inplace=True)
dat['review_scores_location'].fillna(value=0.0, inplace=True)
dat['review_scores_value'].fillna(value=0.0, inplace=True)
dat['reviews_per_month'].fillna(value=0.0, inplace=True)
nullseries = dat.isnull().sum()
print(nullseries[nullseries > 0])
host_response_time 1178 host_response_rate 1178 host_acceptance_rate 984 bathrooms_text 4779 first_review 1083 last_review 1083 avg_rating 1083 dtype: int64
Let us explore the data in the dataset.
First, how many unique hosts are there?
print("There are %s unique hosts in the dataset." %dat['host_id'].nunique())
There are 3370 unique hosts in the dataset.
How many listings does each host have in the Chicago area?
listings_by_host = dat['host_id'].value_counts()
listings_by_host.describe()
count 3370.000000 mean 1.888724 std 5.549723 min 1.000000 25% 1.000000 50% 1.000000 75% 1.000000 max 260.000000 Name: host_id, dtype: float64
listings_by_host[listings_by_host > 1].count()
799
Here we can see that out of the 3,371 unique hosts, 799 have more than one listing in the Chicago area. Interestingly, there is one host id with 260 listings.
Let's graph this data to see the distribution of hosts with differing numbers of listings.
plt.figure(figsize=(18,6))
p = sb.countplot(x=listings_by_host, order=sorted(listings_by_host.unique()))
p.set_xticklabels(labels=p.get_xticklabels(),rotation=90)
p.bar_label(p.containers[0])
plt.xlabel('Number of Listings by Host')
plt.show()
It is obvious that a vast majority of hosts have only one listing in the Chicago area. Let us look at the outliers.
plt.figure(figsize=(18,6))
sb.boxplot(x=dat['calculated_host_listings_count'])
plt.xlabel('Number of Properties Listed by the Host')
plt.show()
#Calculate the lower and upper quartile values
q1 = np.quantile(dat['calculated_host_listings_count'], 0.25)
q3 = np.quantile(dat['calculated_host_listings_count'], 0.75)
#calculate the upper whisker only, since no outliers are below the lower
whishi = q3 + 1.5*(q3-q1)
#create a dataframe where the income is higher than the upper whisker value
out_list = dat[dat['calculated_host_listings_count']>whishi]
print('The upper whisker on the box plot is at ' + str(whishi) + ', with ' +
str(len(out_list['id'])) + ' listings as outliers past this value.')
The upper whisker on the box plot is at 21.0, with 842 listings as outliers past this value.
plt.figure(figsize=(6,6))
p = sb.countplot(x='license',data = dat)
p.bar_label(p.containers[0])
plt.xlabel('Listing Has License (1:Yes)')
plt.show()
A majority of the listings have a license, but it is not a large majority.
dat['host_response_time'].unique()
array(['within an hour', 'within a few hours', nan, 'within a day',
'a few days or more'], dtype=object)
plt.figure(figsize=(6,6))
p = sb.countplot(x='host_response_time',data = dat)
p.set_xticklabels(labels=p.get_xticklabels(),rotation=45)
p.bar_label(p.containers[0])
plt.xlabel('Host Response Time')
plt.show()
Here we can see there are four categories for the response time. Let's define a system for rating the response time by using floating numbers. We will assign the values in hours and as follows:
dat['host_response_time_float'] = dat['host_response_time'].map({'within an hour':1,'within a few hours':5,
'within a day':24,'a few days or more':48}).astype('float')
dat['host_response_time_float'].unique()
array([ 1., 5., nan, 24., 48.])
In order to simplify future analysis, let us count the number of verifications the host has and list this in a new column.
dat['host_verifications'].sample(5)
4000 ['email', 'phone', 'reviews', 'manual_offline'... 1640 ['email', 'phone', 'reviews', 'kba'] 4029 ['email', 'phone', 'offline_government_id', 's... 147 ['email', 'phone', 'facebook', 'reviews', 'off... 4886 ['email', 'phone'] Name: host_verifications, dtype: object
We can see that the verifications are separated by a comma, so we will use this to count the number of verifications each host has.
dat['no_of_verif'] = dat['host_verifications'].str.count(r',') + 1
dat.loc[:, ['host_verifications','no_of_verif']].head()
| host_verifications | no_of_verif | |
|---|---|---|
| 0 | ['email', 'phone', 'reviews', 'manual_offline'... | 6 |
| 1 | ['email', 'phone', 'reviews', 'jumio', 'offlin... | 8 |
| 2 | ['email', 'phone', 'reviews', 'jumio', 'govern... | 6 |
| 3 | ['email', 'phone', 'reviews', 'offline_governm... | 7 |
| 4 | ['email', 'phone', 'facebook', 'reviews', 'kba'] | 5 |
In order to catch any observations where hosts have no verifications, we will set the number of verifications to zero where the host_verifications = 'None'. This is important since the code above would have counted both 'none' and an observation without a comma (i.e. only one verification) as 1.
dat['no_of_verif'] = np.where(dat['host_verifications'] == 'None', 0, dat['no_of_verif'])
dat['no_of_verif'].describe()
count 6365.000000 mean 5.586646 std 2.312465 min 1.000000 25% 4.000000 50% 6.000000 75% 7.000000 max 12.000000 Name: no_of_verif, dtype: float64
On average, hosts have about 5-6 different identity verifications.
plt.figure(figsize=(18,6))
p = sb.countplot(x='no_of_verif',data = dat)
p.bar_label(p.containers[0])
plt.xlabel('Number of Identity Verifications by Host')
plt.show()
Here we repeat the same process for number of amenities listed.
dat['no_of_amen'] = dat['amenities'].str.count(r',') + 1
dat.loc[:, ['amenities','no_of_amen']].head()
| amenities | no_of_amen | |
|---|---|---|
| 0 | ["Hot water kettle", "Wine glasses", "Kitchen"... | 51 |
| 1 | ["Kitchen", "Free street parking", "Long term ... | 30 |
| 2 | ["Kitchen", "Free street parking", "Shampoo", ... | 33 |
| 3 | ["Kitchen", "Long term stays allowed", "Smoke ... | 31 |
| 4 | ["Free street parking", "Shampoo", "Smoke alar... | 22 |
dat['no_of_amen'].describe()
count 6365.000000 mean 31.141241 std 10.613992 min 1.000000 25% 24.000000 50% 31.000000 75% 37.000000 max 82.000000 Name: no_of_amen, dtype: float64
plt.figure(figsize=(18,6))
p = sb.countplot(x='no_of_amen',data = dat)
p.set_xticklabels(labels=p.get_xticklabels(),rotation=90)
plt.xlabel('Number of Amenities Listed by the Host')
plt.show()
Let us take a closer look to see what outliers are in this variable.
plt.figure(figsize=(18,6))
sb.boxplot(x=dat['no_of_amen'])
plt.xlabel('Number of Amenities Listed by the Host')
plt.show()
#Calculate the lower and upper quartile values
q1 = np.quantile(dat['no_of_amen'], 0.25)
q3 = np.quantile(dat['no_of_amen'], 0.75)
#calculate the upper whisker only, since no outliers are below the lower
whishi = q3 + 1.5*(q3-q1)
whislo = q1 - 1.5*(q3-q1)
#create a dataframe where the income is higher than the upper whisker value
out_list_hi = dat[dat['no_of_amen']>whishi]
out_list_lo = dat[dat['no_of_amen']<whislo]
print('\nThe upper whisker on the box plot is at ' + str(whishi) + ', with ' +
str(len(out_list_hi['id'])) + ' listings as outliers above this value.')
print('\n')
print('The lower whisker on the box plot is at ' + str(whislo) + ', with ' +
str(len(out_list_lo['id'])) + ' listings as outliers below this value.')
The upper whisker on the box plot is at 56.5, with 109 listings as outliers above this value. The lower whisker on the box plot is at 4.5, with 11 listings as outliers below this value.
Let's calculate the distance of the listings from the center of the city of Chicago. We will use the following coordinates: 41.8781° N, 87.6298° W.
We will use the Haversine formula to calculate the distance in miles. In order to do so, we will first define a function to perform the calculation.
# Coordinates for center of Chicago in degrees
lat1 = 41.881832
long1 = -87.623177
# Constant, radius of the Earth in miles
r = 3958.8
# Define a function to calculate the distance
def haversine(lat2, long2):
# First convert degrees into radians:
rlat1 = lat1 * (math.pi / 180)
rlat2 = lat2 * (math.pi / 180)
rlong1 = long1 * (math.pi / 180)
rlong2 = long2 * (math.pi / 180)
# Calculate the differnce between the latitudes and longitudes
dlat = rlat1 - rlat2
dlong = rlong1 - rlong2
# Use the Haversine formula (broken into 3 terms for simplification here)
a = (math.sin(dlat / 2) ** 2) # First term
b = math.cos(rlat1) * math.cos(rlat2) # Second term
c = (math.sin(dlong / 2) ** 2) # Third term
e = math.sqrt(a + b * c)
d = 2 * r * e # where r is the radius of the Earth
return d
Then, apply the formula to each observation in the data set, returning the answer in a new column for the distance from the center of the city.
dat['d_center'] = dat.apply(
lambda row: haversine(row['latitude'], row['longitude']),
axis=1)
dat['d_center'].sample(5)
2448 4.601232 639 3.277503 1914 4.747726 2825 4.856196 2913 4.685496 Name: d_center, dtype: float64
dat['d_center'].describe()
count 6365.000000 mean 4.382709 std 2.735817 min 0.091059 25% 2.182025 50% 4.095625 75% 6.078114 max 16.529282 Name: d_center, dtype: float64
plt.figure(figsize=(18,6))
p = sb.histplot(x='d_center',data = dat,binwidth=0.25)
plt.xlabel('Listing distance from center of Chicago (miles)')
plt.show()
The plot shows a skew towards the center of the city, which makes sense that a larger number of listings would be closer to the center.
plt.figure(figsize=(18,6))
sb.boxplot(x=dat['d_center'])
plt.xlabel('Distance from center of Chicago')
plt.show()
#Calculate the lower and upper quartile values
q1 = np.quantile(dat['d_center'], 0.25)
q3 = np.quantile(dat['d_center'], 0.75)
#calculate the upper whisker only, since no outliers are below the lower
whishi = q3 + 1.5*(q3-q1)
whislo = q1 - 1.5*(q3-q1)
#create a dataframe where the income is higher than the upper whisker value
out_list_hi = dat[dat['d_center']>whishi]
out_list_lo = dat[dat['d_center']<whislo]
print('\nThe upper whisker on the box plot is at ' + str(round(whishi,2)) + ', with ' +
str(len(out_list_hi['id'])) + ' listings as outliers above this value.')
The upper whisker on the box plot is at 11.92, with 64 listings as outliers above this value.
plt.figure(figsize=(18,6))
p = sb.countplot(x='property_type',data = dat)
p.set_xticklabels(labels=p.get_xticklabels(),rotation=90)
plt.xlabel('Property Type')
plt.show()
plt.figure(figsize=(12,6))
p = sb.countplot(x='room_type',data = dat)
plt.xlabel('Room Type')
plt.show()
print("The two graphs above show the most common type of listing is an entire apartment. " +
"With the largest property type being 'entire apartment' at %s observations. \n"
%(dat['property_type'].values == 'Entire apartment').sum())
print("In the room type category, there are %s observations for 'Entire home/apt'"
%(dat['room_type'].values == 'Entire home/apt').sum())
The two graphs above show the most common type of listing is an entire apartment. With the largest property type being 'entire apartment' at 2898 observations. In the room type category, there are 4541 observations for 'Entire home/apt'
plt.figure(figsize=(18,6))
sb.boxplot(x=dat['price'])
plt.xlabel('Rental Cost')
plt.show()
#Calculate the lower and upper quartile values
q1 = np.quantile(dat['price'], 0.25)
q3 = np.quantile(dat['price'], 0.75)
#calculate the upper whisker only, since no outliers are below the lower
whishi = q3 + 1.5*(q3-q1)
#whislo = q1 - 1.5*(q3-q1)
#create a dataframe where the income is higher than the upper whisker value
out_list_hi = dat[dat['price']>whishi]
#out_list_lo = dat[dat['d_center']<whislo]
print('\nThe upper whisker on the box plot is at $' + str(whishi) + ', with ' +
str(len(out_list_hi['id'])) + ' listings as outliers above this value.')
The upper whisker on the box plot is at $380.0, with 463 listings as outliers above this value.
dat['price'].describe()
count 6365.000000 mean 159.278240 std 137.654474 min 0.000000 25% 75.000000 50% 120.000000 75% 197.000000 max 999.000000 Name: price, dtype: float64
plt.figure(figsize=(12,8))
sb.regplot(x="d_center", y="price",
line_kws={"color":"r","alpha":0.5,"lw":3}, data=dat)
plt.xlabel('Distance from center of city (miles)')
plt.ylabel('Price ($)')
plt.show()
The regression plot shows a correlation between the distance from the center of the city and the price of the listing. In general, the listings closer to the center of the city are worth more than the ones further away. This makes sense, as property values are generally higher in more populated areas.
We can use these data fields to make a prediction of price based on the distance from the city.
y = dat['price']
X = pd.DataFrame({'Distance':dat['d_center'],
'Const':1})
mod = sm.OLS(y,X)
fit = mod.fit()
fit.summary()
| Dep. Variable: | price | R-squared: | 0.064 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.064 |
| Method: | Least Squares | F-statistic: | 435.2 |
| Date: | Sun, 05 Dec 2021 | Prob (F-statistic): | 1.50e-93 |
| Time: | 21:13:12 | Log-Likelihood: | -40166. |
| No. Observations: | 6365 | AIC: | 8.034e+04 |
| Df Residuals: | 6363 | BIC: | 8.035e+04 |
| Df Model: | 1 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Distance | -12.7313 | 0.610 | -20.863 | 0.000 | -13.928 | -11.535 |
| Const | 215.0760 | 3.153 | 68.218 | 0.000 | 208.895 | 221.256 |
| Omnibus: | 3338.804 | Durbin-Watson: | 1.816 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 23932.376 |
| Skew: | 2.452 | Prob(JB): | 0.00 |
| Kurtosis: | 11.136 | Cond. No. | 10.0 |
Based on the above OLS model, we can say that on average a property that is 1 mile further from the center of the city than another would be $12.73/night cheaper.
plt.figure(figsize=(12,8))
sb.regplot(x="no_of_amen", y="price",
line_kws={"color":"r","alpha":0.5,"lw":3}, data=dat)
plt.xlabel('Number of Amenities Listed')
plt.ylabel('Price ($)')
plt.show()
On average, the listings with more amenities listed are also posted for a higher rental price.
Again, we can use this data to develop a regression model to predict a property's price based on the number of amenities that are listed in the field.
y = dat['price']
X = pd.DataFrame({'Number of Amenities':dat['no_of_amen'],
'Const':1})
mod = sm.OLS(y,X)
fit = mod.fit()
fit.summary()
| Dep. Variable: | price | R-squared: | 0.018 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.018 |
| Method: | Least Squares | F-statistic: | 119.7 |
| Date: | Sun, 05 Dec 2021 | Prob (F-statistic): | 1.29e-27 |
| Time: | 21:13:13 | Log-Likelihood: | -40318. |
| No. Observations: | 6365 | AIC: | 8.064e+04 |
| Df Residuals: | 6363 | BIC: | 8.065e+04 |
| Df Model: | 1 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Number of Amenities | 1.7624 | 0.161 | 10.941 | 0.000 | 1.447 | 2.078 |
| Const | 104.3953 | 5.299 | 19.699 | 0.000 | 94.007 | 114.784 |
| Omnibus: | 3258.084 | Durbin-Watson: | 1.778 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 21967.585 |
| Skew: | 2.403 | Prob(JB): | 0.00 |
| Kurtosis: | 10.729 | Cond. No. | 102. |
Here we can use this model to make predictions based on new properties. For example, if a property were to have 35 amenities listed we can predict the listing price to be about (1.76 x 35) + 104.40 = $166/night.
plt.figure(figsize=(12,8))
sb.regplot(x="host_response_time_float", y="avg_rating",
line_kws={"color":"r","alpha":0.5,"lw":3}, data=dat)
plt.xlabel('Host Response Time (hours)')
plt.ylabel('Average Rating (out of 5)')
plt.show()
The regression plot shows a negative trend in the average rating related to the time it takes the host to respond to messages and booking requests.
plt.figure(figsize=(12,8))
sb.regplot(x="host_response_rate", y="avg_rating",
line_kws={"color":"r","alpha":0.5,"lw":3}, data=dat)
plt.xlabel('Host Response Rate (%)')
plt.ylabel('Average Rating (out of 5)')
plt.show()
In general, the higher the hosts' response rate is, the higher their average listing rating is.
plt.figure(figsize=(12,8))
sb.regplot(x="no_of_verif", y="avg_rating",
line_kws={"color":"r","alpha":0.5,"lw":3}, data=dat)
plt.xlabel('Number of Host Identity Verifications')
plt.ylabel('Average Rating (out of 5)')
plt.show()
While it appears there is a positive correlation between the average rating and the number of verifications a host has, it is very small.
Some of the columns are free text entered by the hosts. We will look at the word clouds to explore which keywords are used most frequently. The data fields we will focus on are:
First, let's define a function to make it easy to generate our wordclouds from the input column:
stopwords = STOPWORDS
stopwords.update(['description','neighborhood_overview','object','dtype','host_about',
'inv','t','name','don','neig','m','o','neig','nei'])
def Mywordcloud(data, title = None):
wc = WordCloud(
background_color = "white",
stopwords = stopwords,
height = 600,
width = 600
).generate(str(data))
fig = plt.figure(1, figsize=(10, 10))
plt.axis('off')
if title:
fig.suptitle(title, fontsize=20)
fig.subplots_adjust(top=2.3)
plt.imshow(wc)
plt.show()
Mywordcloud(dat['description'].dropna())
In the description field, we see a lot of key words show up that would make sense to be in the field. Interestingly the work "vaccinated" has already made it into the one of the most used words.
Mywordcloud(dat['neighborhood_overview'].dropna())
In the neighborhood overview, there are once again a lot of words that you would expect to show up when describing a location. "Ukranian" is a village within Chicago, along with "Lasalle".
Mywordcloud(dat['host_about'].dropna())
Once again here we see common words to describe hosts, along with some of the more popular names in the area.
non_zero_avg = dat[dat['avg_rating']>0]
ax = sb.jointplot(x="d_center", y="avg_rating", data=non_zero_avg[['d_center','avg_rating']],color='b')
ax.plot_joint(sb.kdeplot, zorder=0, n_levels=6)
plt.xlabel('Distance from the center of Chicago')
plt.ylabel('Average Rating (out of 5)')
plt.show()
According to above plot, customers tend to rate the places that are closing to the center of Chicago with the score from 4 to 5 rating.
ax = sb.jointplot(x="accommodates", y="avg_rating", data=non_zero_avg[['accommodates','avg_rating']],color='g')
ax.plot_joint(sb.kdeplot, zorder=0, n_levels=6)
plt.xlabel('Accomodates (people)')
plt.ylabel('Average Rating (out of 5)')
plt.show()
Smaller accommodations tend to have higher average of rating score.
ax1 = sb.jointplot(x="availability_30", y="accommodates", data=dat[['availability_30','accommodates']],color='b')
ax1.plot_joint(sb.kdeplot, zorder=0, n_levels=6)
plt.xlabel('Availabillity Next 30d')
plt.ylabel('Accomodates (people)')
plt.show()
According to above plot, we can see there are big density starting from bottom left corner and spreading to the top and right. We can see that smaller accommodations have lower chance of being availability in 30 days.
ax = sb.jointplot(x="d_center", y="accommodates", data=dat[['d_center','accommodates']],color='g')
ax.plot_joint(sb.kdeplot, zorder=0, n_levels=17)
plt.xlabel('Distance from the center of Chicago')
plt.ylabel('Accomodates (people)')
plt.show()
We can see that smaller accommodations tend to stay close to the city.
Geospatial visualization
The geography of Chicago has a wide spead city area with downtown and suburban locations available for properties to vary with higher pricing as we move towards the downtown. We will consider the following variables for observing varying prices for different locations
We take the 'latitude' and 'longitude' variables for identifying the property location hotspots, and 'price' variable to define the intensity of high and low priced rentals.
The Coordinates channel uses the location field, which contains arrays of latitude-longitude pairs.
The Intensity field uses the price field, which contains rental price for each property.
# mark the loacation of each the property on the map of chicago
map = folium.Map([41.881832,-87.623177], zoomstart= 10)
for index, row in dat.iterrows():
folium.CircleMarker([row['latitude'],row['longitude']],
radius=5,
popup=row['neighbourhood_cleansed'], color='#7ECC49',
fill_color="#FFFF00").add_to(map)
map